Again, sentence similarity is considered. The approach is in parallel to 4.7.1.1, but we use a hash value for the sentence instead of the sentence signature.
First we count the different hash values compared to the number of sentences, then we list the hash values with the highest multiplicities.
First table:
select aa.a,bb.b, aa.a/bb.b from (select count(distinct hash) as a from para_s) aa,(select count(*) as b from para_s) bb;
Second table:
select hash, count(*) as anz from para_s group by hash order by anz desc limit 20;
4.7.3.2 Sentences with Most Frequent Hash Values